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^ ■ We present the results of gravitational direct A^-body simulations using the com- 

', mercial graphics processing units (GPU) NVIDIA Quadro FX1400 and GeForce 

8800GTX, and compare the results with GRAPE-6Af special purpose hardware. 
The force evaluation of the A-body problem was implemented in Cg using the 
GPU directly to speed-up the calculations. The integration of the equations of mo- 
tions were, running on the host computer, implemented in C using the 4th order 



2 ' predictor-corrector Hermite integrator with block time steps. We find that for a 

c/3 . large number of particles (A ^ 10 ) modern graphics processing units offer an at- 

^ ' tractive low cost alternative to GRAPE special purpose hardware. A modern GPU 

continues to give a relatively flat scaling with the number of particles, compara- 
^ . ble to that of the GRAPE. The GRAPE is designed to reach double precision, 

^ I whereas the GPU is intrinsically single-precision. For relatively large time steps, 

the total energy of the N-body system was conserved better than to one in 10^ on 
the GPU, which is impressive given the single-precision nature of the GPU. For the 
same time steps, the GRAPE gave somewhat more accurate results, by about an 
order of magnitude. However, smaller time steps allowed more energy accuracy on 
the grape, around 10~^^, whereas for the GPU machine precision saturates around 
10-^ For N > 10^ the GeForce 8800GTX was about 20 times faster than the host 
computer. Though still about a factor of a few slower than GRAPE, modern GPUs 
outperform GRAPE in their low cost, long mean time between failure and the much 
larger onboard memory; the GRAPE-6Af holds at most 256k particles whereas the 
GeForce 8800GTX can hold 9 million particles in memory. 
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1 Introduction 



Since tlie first large scale simulations of self gravitating systems the direct 
A^-body method has gained a solid footing in the research community. At 
the moment A^-body techniques are used in astronomical studies of plane- 
tary systems, debris di scs, stellar clusters, galaxies all the way to simulations 
of the entire universe (IHutl . 120071 ) . Outside astronomy the main areas of re- 
search which utilise the same techniques are molecular dynamics, elementary 
particle scattering simulations, plate tectonics, traffic simulations and chemi- 
cal reaction network studies. In the latter non-astronomical applications, the 
main force evaluating routine is not as severe as in the gravitational A^-body 
simulations, but the backbone simulation environments are not very different. 

The main difficulty in simulating self gravitating systems is the lack of anti- 
gravity, which results in the requirement of global communication; each object 
feels the gravitational attraction of any other object. 



The first astr onomical simula tion of a self gravitating A^-body system was 
carried out by lHolmbergl (Il94ll ) with the use of 37 light bulbs and photoelectric 
cells to evaluate the forces on the individual objects. Holmberg spent weeks in 
order to perform this quite moderate 37-particle simulation. Over the last 60 
or so years many different techniques have been introduced to speed up the 
kernel calculation. Today, such a calculation requires about 50 000 integration 
steps for one dynamical time unit. At a speed of ~ 10 GFLOP/s the calculation 
would be performed in a few seconds. 



The gravitational A^-body problem has made enormous advances in the last 
decade due to algori t hmic design. The introduction of digital computer s in the 
arena (jvon Hoernerl . Il963l : lAarseth &: Hoyld . Il964j : Ivan Albadal . Il968l ) led to 
a relatively quick evaluation of mutual particle forces. Advanced integration 
techniques, introduced to turn the particle forces in a pre dicted space-time 
traje c tory, opened th e way to predictable theoretical results (lAarseth fc Lecarl . 
19751 : lAarsethl . 119991 ). One of the major developments in the speed-up and 
improved accuracy of the direct A^-body problem was the introductio n of the 
block-time step algorithm (IMakinol . Il99ll : iMcMillan fc Aarsethl . Il993l ). 



In the late 1980s it became q uite cle a r that the advances of modern computer 
technology via Moore's law (iMoord. Il965l) was insufficient to simulate large 
star clusters by the new decade flMakino fc Hutl . Il988l . Il990h . This realization 
brought forward the initiatives employed around the devel opment of special 



hardware for eva l uating the forces between the partic l es (lApplegate et al 



1986 



Taiii et all . Il996l : iMakino fc Taiii 1 19981 : iMakind . l200ll : iMakino et al 



2003f) ■ and of the efficient use of assembler code on general purpo se hardware 
fiNitadori. Makino. fc Hutl . l2006l : iNitadori. Makino. fc Abl . l2007h . 
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One method to improve performance is by parallelising force evaluation Eq. 1^ 

for us e on a Beowulf or cluster computer (with or without dedicated hardware) (IHarfst et al. 



20061 ) ■ a large parallel supercomputer (|Makind . l2002l : iDorband et all l2003l ) 



or for grid operations (iGualandris et al.l . 120071 ) . In particular for distributed 
hardware it is crucial to implement an algorithm that limits communication 
as much as possible, otherwise the bottleneck simply shifts from the force 
evaluation to interprocessor communication. 



A breakthrough in direct-summation A^-body simulations came in the late 
1990s with the developm ent of the GRAPE series of special-purpose computers 
(jMakino &: Taijil . Il998l ). which achieve spectacular speedups by implementing 
the entire force calculation in hardware and placing many force pipelines on a 
single chip. The latest special purpose computer for gravitational A^-b ody sim- 
ulatio ns, GRAPE-6, performs at a peak speed of about 64 TFLOP/s (IMakind . 
200lh . 



In our standard setup, one GRAPE-6Af processor board is attached to a 
host workstation, in much the same way that a floating-point or graphics 
accelerator card is used. We use a smaller version: the GRAPE-6Af which has 
four chips connected to a personal workstation via the PCI bus delivering a 
theoretical peak performanc e of ~ 131 GFLOP/s for systems of up to 128k 
particles at a cost of ~ $6K (IFukushige et al.l . l2005l ). Advancement of particle 
positions [0(A^)] is carried out on the host computer, while interparticle forces 
[0{N^)] are computed on the GRAPE. 

The latest developments in this endeavour is the design and construction of 
the GRAPE-DR, the special pur pose computer which will break the PFLOP/s 
barrier by the summer of 2008 flMakinol . 120071 1^. One of the main arguments 
to develop such a high powere d and re l atively diverse computer is t o perform 
simulations of entire galaxies ( Making . 2005al : [Hoekstra et al . 2007). 



The main disadvantages of these special purpose computers, however, are the 
relatively short mean time between failure, the limited availability, the limited 
applicability, the limited on-board memory to store particles, the simple fact 
that they are basically build by a single research team led by prof. J. Makino 
and the lack of competing architectures. 

The gaming industry, though not deliberately supportive of scientific research, 
has been developing high power parallel vector processors for performing 
specific rendering applications, which are in particular suitable for boost- 
ing the frame-rate of games. Over the last 7 years graphics processing units 
(GPUs) have evolved from fixed function hardware for the support of prim- 
itive graphical operations to programmable processors that outperform con- 
ventional CPUs, in particular for vectorizable parallel operations. Regretfully, 



See http : //grape . astron . s . u-tokyo . ac . jp/grape/ computer/ grape-dr . html 
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the precision of these processors is still 32-bit IEEE which is below the av- 
erage general purpose processor, but for many applications it turns out that 
the higher (double) precision is not crucial or can be emulated at some cost. 
It is because of these developments, that more and more people use the GPU 



for w i der purposes than just for graphics (iFernandd . |2004J : iPharr &: Fernando 



20051 : iBuck et al.l . |200J) . This type of programming is also called general pur- 
pose computing on graphics processing units (GPGPUl!]. Earlier attempts to 
use a GPU for gravitational N-body simulations were carr ied out using approx - 



imate force evaluation methods with shared time steps fiNvland et al l2004h 



but provide little improvement in performance. A 25-fold speed increa s e com - 



pared to an Intel Pentium IV processor was reported by lElsen et al.l (120061 ) 



but details of their implementation of th e force evaluation algorithm are yet 



unclear. Recently, iHamada et al.l (120071 ) proposed the 'Chamomile' scheme 



for running A^-body simulations with a shared time-step algorithm on GPUs. 
Though, their method, using the CUDA programming environment, outper- 
forms our implementations, the shared time step renders their code unpractical 
for simulating dense star clusters. 

Using GPUs as a general purpose vector processor works as follows. Colours 
in computer graphics are represented by one or more numbers. The luminance 
can be represented by just a single number, whereas a coloured pixel may 
contain separate values indicating the amount of red, green and blue. A fifth 
value alpha may be included to indicate the amount of transparency. Using 
this information, a pixel can be drawn. For general purpose computing, the 
colour information of a pixel is used to represent attributes of the computation. 
There are many pixels in a frame, and ideally, these should be updated all at 
the same time and at a rate exceeding the response time of the human eye. 
This requires fast computations for updating the pixels when, for example, 
a camera moves or a new object comes into view. Such operations usually 
have an impact on many or even all pixels and therefore fast computations 
are required. As the majority of pixels do not require information from other 
pixels, processing can be done efficiently in parallel. All information required 
to build a pixel should go through a series of similar operations, a technique 
which is better known as single instruction, multiple data (SIMD). There are 
many different kinds of operations this information needs to go through. The 
stream programming model has been designed to make the information go 
through these operations efficiently, while exposing as much parallelism as 
possible. The stream programming model views all informations as "streams" 
of ordered data of the same data type. The streams pass through "kernels" 
that operate on the streams and produce one or more streams as output. 

In this paper we report on our endeavour to convert a high precision produc- 
tion quality A^-body code to operate with graphics processing units. In § 2 we 



see http : //www . gpgpu . org 
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explain the adopted A^-body integration algorithm, in § 3 we address the pro- 
gramming environment we used to program the GPU. In the sections §4 and 
§ 5 we present the results on two GPUs and compare them with GRAPE-6Af 
and we discuss a model to explain the GPU's performance. In § 6 we sum- 
marise our findings, and in the Appendix we present a snippet of the source 
code in Cg. 



2 Calculating the force and integrating the particles 

The gravitational evolution of a system consisting of N stars with masses nij 
and at position rj is computed by the direct summation of the Newtonian 
force between each of the stars. The force Fj acting on particle i is then 
obtained by summation of all other — 1 particles 

Here G is the Newton constant. For further readabihty we omit the particle 
index i and present vectors in boldface. 

A cluster consisting of N stars evolves dynamically due to the mutual gravity 
of the individual stars. For an accurate force calculation on each star a total 
of IN{N — 1) partial forces have to be computed. This C(iV^) operation is 
the bottleneck for the gravitational A'"-body problem. 

The GPU scheme described in this paper is implemented in the iV-body inte- 
grator. Here particle motion is calculated using a fourth-order, individual-time 
step "Hermite" predictor-corrector scheme (Makino and Aarseth 1992). This 
scheme works as follows. During a time step the positions (x) and velocities 
(v = x) are first predicted to fourth order using the acceleration (a = x) and 
the "jerk" (k = a, the time derivative of the acceleration) which are known 
from the previous step. 

At the start of each simulation an initial time step is calculated 

dt = ly^. (2) 
k 

Here we introduce u as an accuracy control parameter {v — 0.01 for most of 
our simulations, see also Eq. 9). 

The predicted position (xp) and velocity (vp) are calculated for all particles 
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p I 2 6 

Vp = V + Sidt + -k(it^. 



(3) 
(4) 



The acceleration (a^) and jerk (kp) are then recalculated at the predicted time 
from Xp and Vp using direct summation. Finally, a correction is based on the 
estimated higher-order derivatives: 



a = -6Aa/dt2 _ (4k + 2lip)/dt, (5) 
k=12Aa/rft^ + 6(k + kp)/dtl (6) 

Here Aa = a — a^. Which then leads to the new position and velocity at time 
t + dt. 



x = Xp H — dr H dr, 

^ 24 120 ' 

a k 

w=wp + -de + —dt\ 



(7) 



The new timestep is calculated using a new predicted second d erivative of the 



acceleration ap = a + krft for each particle i individually with (lAarsethl . Il985l ) 



1/2 



dt 



Ikllkl 



a? 



(9) 



Here we use for accuracy parameter v = 0.01. 

A single integration step in the integrator proceeds as follows: 

• Determine which stars are to be updated. Each star has an individual time 
(ti) associated with it at which it was last advanced, and an individual time 
step (dti). The list of stars to be integrated consists of those with the smallest 
ti + dti. Time steps are constrained to be p owers of 2, allowing "blocks " of 



many stars to be advanced simultaneously (iMcMillan &: Aarsethl . Il993l ). 
Before the step is taken, check for system reinitialization, diagnostic output, 
termination of the run, storing data. 

Perform low-order prediction of all particles to the new time ti + dti. This 
operation may be performed on the GPU or GRAPE, whatever is available. 
Recompute the acceleration and jerk on all stars in the current block (using 
the GPU or GRAPE, if available), and correct their positions and velocities 
to fourth-order. 
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Table 1 

Detailed information on the hardware used in our experiments. The first column 
gives the parameter followed by the four different hardware setups (GRAPE-6Af, 
GeForce 8800GTX, Quadro FX1400 and ir iformation about the host computer. The 
information for the GRAPE is taken from Making et al. I (l2003l ). the GPU informa- 



tion is from http://www.nvidia.com. The hardware details are the number of pro- 
cessor pipelines (npipe), the processor's clock frequency (/cycle = l/^cycie)) the mem- 
ory bandwidth for communication between host and attached processor (1/tbus)) 
the amount of memory (in number of particles, one particle requires 84 bytes, here 
we adopt Ik = 1024). For measured hardware parameters, see Tab. 3. Note that the 
measured internal communication speed between the device memory and the GPU 
for the 8800GTX and the FX1400 are 86.4 Gbyte/s and 19.2 Gbyte/s, respectively. 



data 


GRAPE-6Af 


8800GTX 


FX1400 


Xeon 


unit 




24 


128 


12 


1 




/cycle 


90 


575 


350 


3400 


MHz 


lAbus 


0.133 


2.0 


2.0 


NA 


GB/s 


Memory 


128k 


9362k 


1562k 




particles 



Note that this scheme is rather simple as it does not include treatment for 
close encounters, binaries or higher order (hierarchical or democratic) stable 
multiple systems. 



3 The programming environment 



The part of the algorithm that executes on the GPU (the for ce evaluation) is 
i mplem ented in the Cg programming language (C for graphics. iFernando fc Kilgard 
([20031), see Appendix A), which has a syntax quite similar to C. The Cg pro- 
gramming environment includes a compiler and run-time libraries for use with 
the open graphics library (OpenGL)0 and DirectxQ graphics application pro- 
gramming interfaces. Though originally developed for the creation of real-time 
special effects without the need to program directly to the graphics hardware 
assembly language, researchers soon recognised the potential of Cg and started 
to apply it not only to high-performance graphics but also to a wide variety of 
"gene ral-purpose computing" problems (iFernandd . |2004| ; iPharr &: Fernando! . 



20051 ). 



^ see 'http : / /www . opengl . or^ 

^ see http : //www .microsoft . com/directx 
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3.1 Mapping the N-hody problem to a GPU 

The challenge in the implementation of an efficient A^-body code on a GPU 
lies in the mapping of the algorithm and the data to graphical entities sup- 
ported by the Cg language. Particle data arrays are represented as "textures" . 
Normally, textures are used to represent pixel colour attributes with one sin- 
gle component (luminance, red, green, blue or alpha), three components (red, 
green and blue) or four components (red, green, blue and alpha). In our im- 
plementation we use multiple textures to represent the input and output data 
of N particles, as follows: 

• Input: position (3iV), velocity (3iV), mass (A'"), acceleration (3A^), jerk {^N) 
and potential [N). 

• Output: acceleration {^N), jerk {?>N) and potential {N) 

All values are represented as single precision (32-bit) floating point numbers 
for a total of 21 floats or 84 bytes per particle. In Appendix A we present a 
snippet of the source code in Cg, showing the implemented force evaluation 
routine. With the 768 Mbyte on-board memory of the GeForce 8800GTX it 
can store about 9 million particles, whereas the GRAPE-6Af can store only 
128k (see Tab.l). 

Transferring data from CPU to GPU is accomplished through the deflnition 
of textures, which can either be read-only or write-only, but not both at the 
same time. The data structures in the CPU are then copied onto appropri- 
ately deflned textures in the graphics card's memory. Obtaining the results 
from GPU to CPU is done by reading back the pixels from the appropriate 
rendering targets into data structures on the host CPU. Therefore the output 
textures (acceleration, jerk and potential) are represented by a double-buffered 
scheme, where after each GPU computation the textures are swapped between 
reading and writing. There is some additional overhead (of order 0{N)) for 
this operation which has to be performed every block time step. 

Conventionally, graphics cards render into a "frame buffer" , a special memory 
area that represents the image seen on a display. However, a frame buffer is 
unsuitable for our purposes as the data elements in this buffer are "clamped" 
to value ranges that map the capabilities of the display. Invariably this means 
that 32-bit real vectors are reduced in resolution and therefore in accuracy 
too. This is perfectly fine for visual displays where the number of colours af- 
ter clamping are still 2^^ (fa 16 million), sufficient to make two neighbouring 
colours indiscernible to the human eye. However, this is unacceptable for scien- 
tific production calculations. The workaround is to create an off-screen frame 
buffer object and instruct GPU programs to render into these rather than to 
the screen. Off-screen frame buffers support 32-bit floating point values and 



8 



are not clamped and therefore preserve their precision. 



The GPU has two main kernel operations available in programmable graphics 
pipelines, these are a "vertex shader" and a "fragment shader". Our imple- 
mentation only makes use of the fragment shader pipeline as it is better suited 
for the kind of calculations in the N-body problem and because the fragment 
pipeline in general provides more processing poweiT^. The host CPU is respon- 
sible for allocating the input textures and frame buffer objects, copy the data 
between CPU and GPU, and binding textures that are to be processed by 
kernels. The lower order prediction and correction of the particle positions is 
also done on the host CPU. In Tab. 1 we summarise the hardware properties 
of the two adopted CPUs and the GRAPE. 



4 Results 



To test the various implementations of the force evaluator we perform several 
tests on different hardware. For clarity we perform each test with the same 
realization of the initial conditions. For this we vary the number of stars from 
N = 256 particles with steps of two to half a million stars (see Tab. 2). Not 
each set of initial condition is run on every processor, as the Intel Xeons, for 
example, would take a long time and the scaling with N is unlikely to change 
as we clearly have reached the CPU-limited calculation regime (see §5). 

The initial conditions for each set of simulations were genera ted by randomly 



selecting the stellar positions and velocities according to the iPlummerl (jl91ll ) 



distribution using the method described by lAarseth et al.l (119741 ). Each of the 
stars were given the same mass. The initial particle representations were scaled 
to virial equilibrium before starting the calculation. 

Each set of initial condi tions is run from t = to t = 0.50 time units 
( Heggie &: Mathieu . 1986 jZ, but the performance is measured only over the 



last quarter of an A^-body time unit to reduce the overhead for reading the 
snapshot and the initialisation of the integrator. The maximum time step 
for the particles was 0.125, to guarantee that each particle was evaluated at 
least twice during the course of the simulation. For the minimum timestep we 
adopted 1/2^^, and all time steps were calculated using u = 0.01 in Eqs. 2 and 
9. The force calculations were performed by adopting a softening of 1/256 for 
all simulations. 



^ Before the 8800GTX family of GPUs, vertex programs and fragment programs 
had to execute on distinct processing units. The 8800GTX is the first generation of 
GPUs where this distinction no longer exists and the two are unified. 



see also http : //en . wikipedia . org/ wiki/Natural_unit s#N-body_miit s 



Table 2 

Results of the performance measurements for a Plummer sphere with N equal mass 
particles initially in virial equilibrium for 0.25Ar-body time units (from t = 0.25 
to t = 0.5) using a softening of 1/256. In the first column we list the number of 
particles, followed by the timing results of the GRAPE in seconds. In the last column 
we give the timing results for the calculation without an attached processor. The 
GRAPE (second column) was measured up to 64k particles, because the on-board 
memory did not allow for larger simulations. Simulations on the FX1400 and the 
host computer were limited to the same number for practical reasons. 



N 


GRAPE-6Af 


8800GTX 


FX1400 


Xeon 


256 


0.07098 


2.708 


3.423 


0.1325 


512 


0.1410 


8.777 


10.59 


0.5941 


1024 


0.3327 


17.46 


20.20 


2.584 


2048 


0.7652 


45.27 


54.16 


10.59 


4096 


1.991 


128.3 


157.8 


50.40 


8192 


5.552 


342.7 


617.3 


224.7 


16384 


16.32 


924.4 


3398 


994.0 


32768 


51.68 


1907 


13180 


4328 


65536 


178.2 


3973 


40560 


19290 


131072 




8844 






262144 




22330 






524288 




63960 







4-1 Performance measurements 



For our performance measurements we have used four nodes of a Hewlett- 
Packard xw8200 workstation cluster, each with a dual Intel Xeon CPU running 
at 3.4 GHz and either the GRAPE, a Quadro FX1400 or GeForce 8800GTX 
graphics card in the PCI Express (16x) bus. The cluster nodes were running a 
Linux SMP kernel version 2.6.16, Cg version 1.4, graphics card driver version 
1.0-9746 and the OpenGL 2.0 bindings. 

In Figure 1 we show the timing results of the A^-body simulations. The FX1400 
is slower than the general purpose computer over the entire range of A^ in our 
experiments. The bad performance of the FX1400 is mainly attributed to the 
additional overhead in communication and memory allocation. For A^ ^ 10^ 
the GeForce 8800GTX GPU is slower than the host computer but continues 
to have a relatively flat scaling, comparable to the GRAPE-6, whereas the 
host has a much worse (oc A^^) scaling. The scaling of the compute time of 
the GPU is proportional to that of the GRAPE (oc A^^/^), but the latter has 
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Fig. 1. Timing of several implementations of the gravitational A^-body simulations 
for N = 256 particles to = 512k particles (only for the 8800GTX, the others 
up to 64k) over one A^-body time unit. The 8800GTX are represented with open 
circles connected with a solid curve, the GRAPE is given by bullets with dashed 
line. The thin dashed (triangles) line and thin dotted (squares) lines give the results 
of the calculations with the FX1400 and with only the host computer. Note that 
the timings in Tab. 2 were multiplied by a factor of four to estimate the compute 
time for one dynamical time unit, rather than the 1/4*^ over which the timing 
calculations were performed. 

a smaller offset by about an order of magnitude. This is mainly caused by 
the efficient use of the GRAPE pipeline, which requires fewer clock cycles per 
force evaluation compared to the GPU (see §6). 



4-2 Accuracy measurements 



The GPU has a lower accuracy compared to the GRAPE or the host work- 
station. The GRAPE-6 uses 24-bit mantissa for calculating the differential 
position, and 64-bit fixed point format for accumulation. The pipeline for the 
time derivative is designed wit h 20-bit mantissa an d 32-bit fixed-point nota- 



tion for the final accumulation (iMakino et al.l . l2003l ) 



With the literature value of the mantissa the NVIDIA architecture should at 
most be able to reach an accuracy in relative energy {\AE\/E) of about 1/10^, 
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Fig. 2. Absolute value of the relative error in the energy as a function of the timestep 
control parameter v. For clarity we only present the result for simulations with Ik 
(thick curves with circles) and 16k (thin curves with bullets) particles with the 
8800GTX (solid curves) and GRAPE (dashes). The simulations were performed 
over 0.25 N-body time units using identical input realizations as adopted for the 
performance measurements. 



whereas the GRAPE will be able to reach much higher accuracy. We tested 
this by running the initial conditions for Ik particles and 16k particles on the 
8800GTX as well as on the GRAPE for a range of accuracy parameters (z/ in 
Eqs. 2 and 9) ranging from v = 0.1 (very low accuracy, but fast calcula tion) 
down to u = 0.1/2*^ (very accurate but slow calculation) (Aarseth, 19851 ). 



In fig. 2 we present the degree to which energy E is conserved (in units 
of \AE\/E) for the simulations running on GRAPE (dashed lines) and the 
8800GTX (solid curves) as a function of the accuracy parameter u. For rel- 
atively large values of u ^ 0.02 the GRAPE and 8800GTX produce similar 
energy errors, indicating that we are in the regime where the accuracy is lim- 
ited by the integrator. For smaller values of z/, however, the GRAPE has far 
superior energy conservation compared to the GPU. The error for the GRAPE 
continues to decrease for smaller values of z/, until about \AE\/E ~ 10^^^, 
whereas for the GPU the energy error does not drop below \AE\/E ~ 10~^. 
Note that for all the performance calculations in §4.1 we adopted u = 0.01, 
which produces about maximum accuracy achievable for the GPU. For the 
GRAPE, however, we could have used much smaller integration time steps. 
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resulting in considerably smaller energy errors. The price to pay, however, 
would be a much longer compute time. 



5 Performance modelling of the GPU 



In modelling the performance of t he GPU we adopt the model proposed by 



Makind (120021 ): iHarfst et al.l (120061 ) but tailored to the host plus GPU and to 



the GRAPE architecture. 

The wall clock time required for advancing the Tibiock particles in a single block 
time step in the A^-body systems is 

^step ^host ~l~ ^forcc ~l~ ^comirf 

(10) 



Here thost = ^pred + ^corr is the time spent on the host computer for predicting 
and correcting the particles in the block, tforce is the time spent on the attached 
processor and tcomm is the time spent communicating between the host and the 
attached processor. We now discuss the characteristics of each of the elements 
in the calculation for tstep- 



Host operation. The predictions and corrections of the particles are calcu- 
lated on the host computer, and the time for this operation is directly related 
to the speed of the host processor tcpu, the number of operations in the predic- 
tion step ?7pred and in the correction step r^corr- The total time spent per block 
step then yields 

^pred — '^pred^cpu ("^"^) 



for the prediction and 



^corr — '7corr^cpu'^block; 

(12) 



for the correction. The number of clock cycles per prediction step rypred — 900 
and for the correction r/corr — 16000. This operation could be performed on 
the GPU, though the GRAPE is not designed for the predictor and corrector 
calculation. For a fair comparison between the GRAPE and the GPU and in 
order to preserve high accuracy we performed these calculations on the host. 



Communication. The time spent communicating between the host and the 
attached processor is expressed by the sum of the time needed to send Usend 
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particles to the acceleration hardware and the time needed to receive rirec 
particles from the acceleration hardware: 

^comm ^send^send^send ~l~ ^rec^rec^send • 

(13) 



Here r^send^send and ?7rec^rec are the time needed to send and receive one par- 
ticle, respectively. For the computation without the hardware acceleration 
^comm = 0, since the forces between all particles are calculated locally. For 
the GRAPE and the GPUs, however, a considerable amount of time is spent 
in communication. For the GRAPE sending data is equally fast as receiving 
data, i.e: tgend = W = ^bus- Sending data to the GPU is considerably slower 
than receiving data (see Tab. 3). 

The two send and receive efficiency factors ?7send and rjj-ec, are the product 
of the overhead rjo and the number of by tes per particle that has to be 



sent or received. The overhead rjo = 188 (iFukushige et al.l . |2005| ) for each 
of the attached processors. Since for the GRAPE the send and receive oper- 
ation are equally expensive we can just count the number of bytes that has 
to be transported per particle, which for the GRAPE hardware is 72 bytes 



(IFukushige et all . l2005l : iHarfst et al.l . 120061 ). For the GRAPE we then write 



'7send^send ~l~ ^rec^rec 72 X 188tbus- 

For the GPU ?7send > Vrec (see Table 3 for the measured values). The additional 
overhead rjo is the same as for the GRAPE, but per particle the number of 
bytes to send is different from the number to receive. As we discussed in §3.1 
a total of 56 bytes has to be sent from the host to the GPU, whereas only 
28 bytes are received. For the GPU we then write r^send = 56 x 188, whereas 

T]rec = 28 X 188. 

In addition to the difference in the speeds for sending and receiving data, the 
GPU suffers from an additional penalty. The GRAPE sends the particles in 
the block (rigend = '^biock), whereas due to internal memory management the 
GPU has to send and receive all particles nsend = N (see § 3.1). This efficiency 
loss is quite substantial, but is reduced when we use CUDA as programming 
environment (see §6). 

For the adopted (Hermite predictor-corrector block time-step) integration 
scheme the number of particles in a single block nbiock cannot be determined 
implicitly, though theoretical arguments suggest nbiock oc A^^''^. Instead of us- 
ing this estimate, we fitted the average number of particles in a block time 
step. This fit was done with the equal mass Plummer sphere initial conditions 
running on GRAPE and run over one dynamical (N-body) time unit. The 
average number of particles in a single block is then 



n-block 



0.20A^°-*^ (14) 
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Calculation. The time spent by the hardware acceleration (tforce) is directly 
related to the speed of the dedicated processor (tcycie), the number of pipelines 
per processor (/^pipe) and the number of operations for one force evaluation 
(r^fc ~ 60). 

^force — '7fe-^''^block^cycle/'^pipe- i^^) 



The details of the different hardware are presented in Tab. 1 and the measured 
values are in Tab. 3 The GRAPE has a vector pipeline for each processor 
which allows a more efficient force evaluation than the GPU, the number of 
operations per force evaluation for the GRAPE is therefore r^fe — 0{1) 

In order to enable hardware acceleration on our A^-body code we had to in- 
troduce a number of additional operations, like reallocating arrays, which give 
rise to an extra computation overhead. For the calculations with the host 
computer without hardware acceleration we adopt r/fe — 180, a factor of three 
larger than for the GPUs. 



Total performance. The total wall-clock time spent per dynamical (N- 
body) time unit is then 

t '^steps^step- (-^6) 



Here we fitted the number of b lock steps p er dynamical (N-body) time units. 
According to Makino fc Hut ( 1988 . IQQOt ) ngteps oc n^/^. We measured the 



number of block time steps using the equal mass Plummer distributions as 
initial conditions, using the GRAPE enabled code and fitted the result: 



steps — 



2A7N 



0.35 



(17) 



In Fig. 3 we compare the results of the performance model with the measure- 
ments on the workstation without additional hardware (squares) and with 
three attached processors; a single GRAPE-6Af processor board (bullets), an 
FX1400 (triangles) and the newer GeForce 8800GTX (circles). Note that the 
measurements in Tab. 2 were multiplied by a factor four to compensate for 
the fact that we performed our timings only over a quarter A^-body time unit. 
Though these curves are not fitted, they give a satisfactory comparison. 

The largest discrepancy between the performance model and the measure- 
ments can be noticed for the FX1400 GPU, which, for A^ ^ 10^ seems to per- 
form considerably less efficient than expected according to the performance 
model. Part of this discrepancy, though not explicitly mentioned in § 5 is 
the result of a hysteresis effect in the communication of both GPUs. For the 
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Fig. 3. The results of the above described performance model (thick lines) over-plot- 
ted with the results of the measurements for the three attached processors (symbols). 
The bullets represents the results from a single GRAPE-6Af processor board, the 
squares give the host workstation, the circles are for the GeForce 8800GTX and the 
triangles give the FX1400 graphics processor. 

8800GTX, however, this effect is less evident, but still present. Both GPUs 
tend to have a maximum communication speed for blocks of 0.5 Mbyte (about 
6000 particles). An additional effect which causes performance loss on the 
FX1400 is the increase in the number of block time steps. This number con- 
tinued to increase beyond our measurements performed with GRAPE (see 
Eq. 14). 

The numbers listed in Table 3, and used in our performance model, are the 
optimum values. The communication speed drops by about a factor of two for 
much larger amounts of data transfer to and from the GPU. For the FX1400, 
this drop in communication is considerable, whereas for the 8800GTX it results 
in a smaller performance loss (mainly due to the larger number of processor 
pipelines). The discrepancy for the GRAPE calculation with low N is the 
result of neglecting the limited size of the processor pipeline in the performance 
model and due to the irregular behavior of the number of particles in each 
block time step. 

In Fig. 4 we present the speed-up for the various hardware configurations, 
compared to running on the host workstation. Here it is quite clear that for 
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Table 3 

Time measurements (in seconds) for the various hardware operations used in this 
paper. The first hne gives the time spent by the host computer for predicting and 
correcting a single particle. The second row is for calculating the force between two 
particles. The last two rows give the time to send a single particle to, and to receive 
a single particle from the attached hardware. For the calculations with only the host 
computer this operation is not available. In particular the communication with the 
GPUs turns out to be relatively slow. 



par am 


GRAPE-6Af 


8800GTX 


FX1400 




Xeon 


^host 


3.82 X 10-'^ 


3.82 X lO"'^ 


3.82 X 10" 


-7 


3.82 X lO"'^ 


^fe^cycle/^pipe 


4.63 X 10"^° 


8.15 X 10"^° 


1.43 X 10- 


-8 


5.29 X 10-*^ 


^send^send 


8.00 X lO"'^ 


1.76 X 10"^ 


1.89 X 10" 


-5 


NA 


^rec^rec 


8.00 X 10-"^ 


5.97 X 10"^ 


5.98 X 10" 


-6 


NA 



low the GPUs do not give a appreciable speedup, but for a large number 
of particles, the GeForce 8800GTX gives a speedup of at least an order of 
magnitude, but not as much as the GRAPE. The latter, however, will not be 
able to perform simulations of more than 128k particlea ^ I. 



6 Discussion 



We have successfully implemented the direct gravitational force evaluation 
calculation using Cg on two graphics cards, the NVIDIA Quadro FX1400 and 
the NVIDIA GeForce 8800GTX, and compared their performance with the 
host workstation and the GRAPE-6Af special purpose computer. 

For N ^ 10^ particles the workstation outperforms the GPUs. This is mainly 
due to additional overhead introduced by the communication to the GPU 
and memory allocation on the GPU. For a larger number of particles the 
more modern GPU (8800GTX) outperforms the workstation by up to about 
a factor of 50 (for 9 million particles). Such a large number of particles cannot 
be simulated on the GRAPE-6Af, due to memory limitations. For up to 256k, 
the maximum number of particles that can be stored on the GRAPE, the 
8800GTX is slower than the GRAPE by a factor of a few. Still, at this particle 
number the GPU is faster than the workstation by an order of magnitude. 

For the adopted accuracy u = 0.01, the average mean error in the energy 
measured over 0.25N-body time unit is \AE\/E = (1.7 ± 1.6) x 10~^ for 



^ Due to a defective chip on our GRAPE-6 the on-board memory was reduced 
from 128k particles to 64k particles. The latest GRAPE-6Af are equipped with 
256k particles of memory. 
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Fig. 4. The speedup of the two GPUs and the GRAPE with respect to the host 
workstation as a function of the number of particles. The (lower) dotted curve is 
for the Quadro FX1400, the solid curve (middle) gives the timing for the GeForce 
8800GTX and the top line (dashes) represents the GRAPE. 

the 8800GTX and (5.1 ± 0.56) x IQ-^ for the FX1400 (averaged over the 

simulations for N = 256 to iV = 64k), whereas for the GRAPE we measured 
(1.9 ± 1.2) X 10"''', which is comparable to the mean error on the host. For the 
adopted accuracy, both the host and GRAPE produce an energy error which 
is about an order of magnitude smaller than that of the GPUs. For smaller 
values of u the energy errors in the GRAPE continue to decrease whereas 
for the GPU this is not the case, as we have reached the precision of the 
hardware. For many applications an energy error of |A£^|/£^ = O(10~^) may 
be satisfactory. 

In the release notes of CUDA version 0.8, NVIDIA announced that GPUs 

supporting 64-bit double precision floating point arithmetic in hardware will 
become available in late 2007. In the meantime, we could improve the accuracy 
of the GPU by sorting the forces on size before adding them, summing the 
smallest forces first. 

The GRAPE-6 is much more efficient in using its clock cycles, allowing effec- 
tively one operation per clock cycle, whereas the NVIDIA architecture requires 
more cycles. This turns out to be an important reason why the 8800GTX is 
slower than the GRAPE-6. 
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The main advantage of the GPU over that of the dedicated GRAPE hardware, 
is the much larger memory, the wider apphcabihty and the much lower cost 
of the former. The large memory on the GPU allows simulations of up to 
about 9 million particles, though one has to wait for about two years for one 
dynamical time scale. 

In theory the 8800GTX should be able to outperform the GRAPE-6Af, but 
due to relatively inefficient memory access and additional overhead cost, which 
is not present in the GRAPE hardware, many clock cycles seem to get lost. 
With a more efficient use of the hardware the GPU could, in principle, improve 
performance by about two orders of magnitude. For the next generation of 
GPUs we hope that this efficiency bottleneck will be lifted. In that case, 
the GPU would outperform GRAPE by almost an order of magnitude. Note, 
however, that the GRAPE-6 is based on 5 year old technology, and the next 
generation GRAPE is likely to outperform modern GPUs by a sizable margin. 

These current bottlenecks in the GPU may be reduced using the Compute 
Unified Device Architecture (CUDA0 programming environment, which is 
supposed to provide an improved environment for general purpose program- 
ming on the GPU. In Fig. 5 we present the possible future performance as- 
suming that the additional communication overhead on the GPU is lifted, the 
clock cycles are used more efficiently without any assumptions of improved 
hardware speed. In the first step we simple reduce communication to blocks 
rather than having to transport all particles each block time step (solid curve). 
This relatively simple improvement has re cently been carried out using CUDA 



( iHamada et al.l . 120071 : iBedorf et al.l . 120071 ). The second optimalization (dashed 



curve in Fig. 5) is achieved when, in addition to reducing the communication 
we also carry the predictor and corrector steps to the GPU. This improvement, 
however, may be associated with a quite severe accuracy penalty. For both im- 
provements we used the performance data for the current design 8800GTX. 
Further improvement can be achieved when, in addition to more efficient com- 
munication and force computing pipeline in further optimized. The result of 
this hypothetical case would improve performance by more than a factor 100 
compared to the workstation over the entire range of N. 
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Appendix A 



The N-body code presented in this paper consists of a part implemented in C 
(running on a CPU) and a part implemented in Cg (running on the GPU) . In 
this appendix we show the routine that evaluates the acce leration, jerk an d 
potential in Cg (which was based on a tutorial available from lGoddekd ( 20051 )). 
The C code which handles communication between CPU and GPU and sup- 
porting data structures is not presented here. A copy of the entire working 



version of the code is available via http://modesta.science.uva.nl 
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void conipute_acc_jerk_aiid_pot( 

in float2 coords : TEXCODRDO, 

out floats acc : CDLORO, 

out float4 jerkAndPot : COLORl, 
uniform samplerRECT accTexture, 
uniform samplerRECT j erkAndPotTexture , 
uniform samplerRECT massTexture, 
uniform samplerRECT posTexture, 
uniform samplerRECT velTexture, 
uniform float eps2, 
uniform float otherParticle, 
uniform float texSizeX, 
uniform float texSizeY, 
uniform float offset) 



// 2D texture coordinate of this particle 

// Output texture with acceleration 

// Output texture with jerk and potential 

// Input texture with all particles' acceleration 



// >> >» »> »> »> 

// >> >» »> »> »> 

// »> >» »> »> »> 

// >> »» »» »» »> 

// Softening parameter 
// Index to other particle 
// Width of all textures 
// Height of all textures 
// Number of unused texture elements 



jerk and potential 
mass 

position 
velocity 



float coordslD, newCoordslD, otherMass, 

r2, xdotv, r2inv, rinv, rSinv, rSinv 
float2 newCoords; 

floats pos, otherPos, vel, otherVel, dx, 



xdotvrSinv; 
dv, thisAcc, this JerkAndPot; 



// Get data from the textures 

acc = texRECT (accTexture, coords). rgb; 

jerkAndPot = texRECT(j erkAndPotTexture, coords) .rgba; 

pos = texRECT (posTexture, coords). rgb; 

vel = texRECT (velTexture, coords). rgb; 



// Convert the 2D texture coordinate to ID and increase with otherParticle 

// to obtain the coordinate of this iteration's other particle. Because our 

// textures are defined as samplerRECT, texture elements must be addressed 

// as (x+0.5,y+0.5) . When converting to ID, we must compensate for this offset). 

coordslD = round(coords.y-0.5)*texSizeX + round(coords.x-0.5) ; 

newCoordslD = coordslD + otherParticle; 



// Skip over unused texture elements 

if (newCoordslD + offset > texSizeX*texSizeY - 1) 

newCoordslD = newCoordslD - (texSizeX*texSizeY - offset) ; 



// Convert the other particle's ID coordinate to 2D. As above, we must add 
// 0.5 to obtain correct texture element coordinates. 
newCoords = 0.5 + float2( frac(newCoordslD/texSizeX)*texSizeX, 

floor (newCoordslD/texSizeX) ); 

// Get the position, velocity and mass of this iteration's other particle 
OtherPos = texRECT (posTexture, newCoords) .rgb; 
OtherVel = texRECT (velTexture, newCoords) .rgb; 
OtherMass = texRECT (massTexture, newCoords) .r; 



// Compute acceleration, jerk and potential 

dx = otherPos-pos ; 

dv = otherVel-vel ; 

r2 = eps2 + dot(dx,dx); 

xdotv = dot (dx , dv) ; 

r2inv = 1.0/r2; 

rinv = sqrt (r2inv) ; 

rSinv = r2inv*rinv; 

r5inv = r2inv*r3inv; 

xdotvr5inv = 3.0*xdotv*r5inv; 

thisAcc = 0therMass*r3inv*dx; 

this JerkAndPot = otherMass* (r3inv*dv - xdotvr5inv * dx) ; 
acc = acc + thisAcc; 

j erkAndPot . rgb = j erkAndPot . rgb + this JerkAndPot ; 
jerkAndPot. a = jerkAndPot. a - otherMass*rinv; 
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